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Abstract 

The LOCKSS digital preservation system 
collects content by crawling the web and pre- 
serves it in the format supplied by the publisher. 
Eventually, browsers will no longer understand 
that format. A process called format migration 
converts it to a newer format that the browsers do 
understand. The LOCKSS program has designed 
and tested an initial implementation of format 
migration for Web content that is transparent to 
readers, building on the content negotiation ca- 
pabilities of HTTP. 

1 Introduction 

Eventually, any format in which digital con- 
tent might be stored will become obsolete. A for- 
mat is said to be obsolete when current hardware 
and software are no longer able to render infor- 
mation represented in it understandable to read- 
ers. The design of digital preservation systems 
must anticipate this obsolesence, and incorporate 
a strategy by which the content they preserve will 
still be understood by readers after multiple gen- 
erations of formats have become obsolete. Two 
such strategies have been identified: 

Emulation in which the content is both pre- 
served and presented to readers in the orig- 
inal format 10 . 

Migration in which the content is presented in 
a current format; it may be preserved in a 
succession of current formats or in the orig- 
inal format which is transformed on request 
into the current format for presentation [H]. 

Some software business models depend on a 
rapid upgrade cycle. In these areas rapid format 
change is normal; users who do upgrade produce 
a format users who have yet to upgrade cannot 
interpret. This is a powerful motivation for fur- 
ther upgrades, and thus a powerful income gener- 
ator. But note that rapid format change doesn't 



imply rapid format obsolescence. An upgraded 
application that didn't accept old formats would 
not be an effective income generator. 

We provide an overview of the problem of 
format obsolescence as applied to Web content 
and, in this context, examine possible implemen- 
tations of the two strategies. We identify the 
practical difficulties that face any implementa- 
tion of emulation; they led us to choose the mi- 
gration strategy. We describe the design and im- 
plementation of a transparent, on- access format 
migration capability for the LOCKSS 1 system for 
preserving Web content. 

Our implementation is capable of transpar- 
ently presenting content collected in one Web for- 
mat to readers in another Web format, with no 
changes needed to browsers. The reader need 
take no special action to cause this to happen, 
nor even be aware that it is happening. This ap- 
pears to be the first time that a production digi- 
tal preservation system has demonstrated trans- 
parent format migration of live content collected 
from the Web for end-users. 

2 Format Obsolescence of Web Content 

A Web format may be said to be obsolescent 
when widely-used browsers are no longer able to 
present content in the format to their readers. 

To the casual observer it may appear that 
the format in which Web content is supplied is 
solely determined by the Web server, possibly by 
the file name extension. In fact, the format is de- 
termined by one of a set of mechanisms for con- 
tent negotation defined in Section 12 of 5 which 
are capable of negotiating format, language, and 
encoding. The mechanism for format negotation 
uses the Accept: header defined in Section 14.1 
of [5]. A browser sends this to the server with 
a list of acceptable Mime- Type values, each with 



1 LOCKSS is a trademark of Stanford University. It 
stands for Lots Of Copies Keep Stuff Safe. 



a numeric preference value between and 1. If 
it is capable of supplying the requested content 
in multiple formats the server MAY decide on the 
basis of this list and its preference values which 
format to supply. Browsers determine the for- 
mat in which the server has decided to supply 
the content they requested using the Mime- Type: 
header. 

In practice, a browser does not know when 
it issues a request for a URL whether it refers to 
text, audio, video or some other class of object. 
Browsers therefore send a default Accept: header 
on most GET requests specifying their preferred 
Mime- Type values. These lists typically include a 
low-preference default */*;q=0.1 saying, in effect, 
"if you can't give me what I want, give me what 
you have" . Because browsers indicate in this way 
their willingness to receive any format, there is 
some difficulty in determining when obsolescence 
has ocurred. 

In brief, the problem of format obsolescence 
for a system preserving Web content is that of 
what to do when it receives a request for some 
content that was collected in format F/G, say im- 
age/gif, whose Accept: header indicates F/G is not 
an acceptable format. 

In the light of the */*;q=0.1 usage, there are 
two ways in which this can happen; the browser 
can explictly signal it or the server can be config- 
ured to assume it. Although Section 12 of does 
not define the semantics of a preference value, 
it appears that servers treat this as an instruc- 
tion not to send content in that Mime- Type. Thus 
even a browser that uses */*;q=0.1 can flag a for- 
mat as unacceptable by F/G;q=0. Alternatively, 
a server could be configured not to recognize F/G 
as matching */*. 

Fortunately, since Web browsers and their 
plug-ins are normally free, there are few incen- 
tives for rapid format change and particularly 
obsolescence. No-one clamors to remove support 
for an old Web format; it is valuable so long as 
there is content on Web servers that has not yet 
been migrated to a current format. There are no 
good ways to motivate small Web sites to per- 
form this migration, so old Web formats die a 
very slow death. From the viewpoint of digital 
preservation, this makes Web content easier to 
handle; there will be plenty of time to implement 
a format migration. 



2.1 Emulation of Obsolete Web Formats^ 

The goal of the emulation preservation 
strategy is to avoid the loss of fidelity that is 
likely to result from converting content from one 
format to another. If the content is preserved in 
its original format and presented to the reader 
in that format, no conversion is needed. What 
is needed is the ability of a future reader to run 
the software the original reader would have run 
to experience the content. The emulation strat- 
egy seeks to provide that by preserving the orig- 
inal software as well as the content, and provid- 
ing the future reader with a software emulation 
of the environment needed to run the original 
software to interpret the preserved content in its 
original format. In a suitable context the emu- 
lation strategy is attractive; it is being pursued, 
for example, by a collaboration between IBM 7 
and the Koninklijke Bibliotheek (KB, Dutch Na- 
tional Library) JI] which has built a PDF inter- 
preter that runs on a Universal Virtual Computer 
(UVC, a virtual machine designed to be easy to 
port to future environments). The terms under 
which the KB preserves content make this appro- 
priate; they mandate that it be accessed only at 
the KB, where deployment of the UVC is easy. 

In the Web context emulation means that 
a future reader wishing to read a preserved Web 
page that contains some content in an obsolete 
format must somehow find out the approximate 
date of the original content, then locate a pre- 
served browser or plug-in of that date, and the 
appropriate emulation needed to allow that pre- 
served browser or plug-in to run in the reader's 
current computing environment. The reader 
must then invoke this emulation to run the pre- 
served browser or plug-in to view the Web page. 

Since in the emulation strategy all these ac- 
tivities take place in the reader's environment, 
there is little the preservation system can do to 
enable them. It has no control over the reader's 
environment. Indeed, if it is disseminating the 
preserved content by acting as a Web server, it 
will have almost no knowledge of the reader's en- 
viroment. Although the effect of a successful em- 
ulation strategy would be to prevent the preser- 
vation system ever seeing a request with F/G;q=0 
in its Accept: header, the practical difficulties in 
implementing both the emulation of instruction 
sets, operating systems, etc. and in deploying 
both the approrpiate emulation and the appro- 
priate preserved browser or plug-in to the appro- 



priate reader are formidable. 

2.2 Migration of Obsolete Web Formats 

Migration of Web content from an obsolete 
format to a current one can take place at any time 
between the point at which the content is col- 
lected to the point at which the reader requests 
access to it. We examine three points that have 
been implemented, from the earliest to the latest. 

2.2.1 Migration On Ingest 

The National Archives of Australia (NAA), 
faced with a requirement to preserve vast vol- 
umes of government information in a wide variety 
of mostly proprietary formats, chose a strategy 
of migration on ingest 6 . They pre-emptively 
migrate the content they receive into one of a 
small number of carefully chosen formats before 
preserving it. If their choices turn out well, this 
pragmatic approach has significant advantages: 

• It can postpone the need for future migra- 
tion for a long time, allowing both economies 
in operation and the use of better, future 
technology for performing the next migra- 
tion. 

• It can greatly reduce the cost of eventual 
future migrations by reducing the number 
of formats to be migrated. 

Both these advantages are greatly enhanced 
if the formats chosen are open standards and are 
supported by open source software, as they are 
in the case of NAA. Most of the material NAA 
handles is not from the Web, and most Web for- 
mats would meet their criteria without an initial 
migration. 

The disadvantages of this approach are two- 
fold. First, it does not fully satisfy the require- 
ments of archivists, because the content is not 
preserved in its original form 2 . Some potentially 
useful information may be lost in the initial mi- 
gration. Second, it postpones the format migra- 
tion problem but does not actually solve it. Even 
the chosen formats will eventually become obso- 
lete. 



2 NAA actually does preserve both the original and the 
migrated format but expects that access to the original 
will be an exception 



2.2.2 Batch Migration 6 

When a format in which some content is 
being preserved is thought likely to become ob- 
solete, a batch migration process can be preemp- 
tively undertaken. The preserved content in the 
obsolete format is converted to a current format 
en masse. Some stand-alone tools for doing so 
have been developed ^21 but they have yet to 
be integrated into a complete digital preserva- 
tion system. When a reader requests access to 
the preserved content, the result of the conver- 
sion will be delivered. 

The DAITSS (Dark Archive In The Sun- 
shine State) is designed to use a batch migration 
strategy U 3 . As a dark archive, one not in- 
tended to be accessed by readers but maintained 
in a controlled environment, this is an appropri- 
ate solution. The archive has total control of the 
environment and no urgent demands from end- 
users to satisfy. 

2.2.3 On-Access Migration 

The alternative migration approach, migra- 
tion on access, postpones format migration until 
the reader actually requests the preserved con- 
tent. It avoids the disadvantages of the other 
migration strategies by preserving the content in 
its original formats. When a format is thought 
likely to become obsolete, the digital preservation 
system is enhanced with the ability to present 
the reader, upon request, with the requested con- 
tent in a current format. In effect, the migration 
tool is integrated into the dissemination pipeline 
of the preservation system rather than being ap- 
plied to the preserved content. 

This approach requires the ability to con- 
vert dynamically from the obsolete to the current 
format, but it offers significant advantages: 

• Content is preserved in its original format, 
satisfying the archivists' requirements and 
avoiding the risk of information loss from 
buggy format convertors. This risk is clearly 
enough to motivate systems using other mi- 
gration strategies to hedge their bets by pre- 
serving the original format too. 

• Preserved content is migrated by the most 
recent, and presumably best, technology 
available at the time the reader requests ac- 

3 DAITSS also preserves both the original and the mi- 
grated format 



cess. 

• Preserved content is rarely accessed, thus 
delaying format migration until it is actu- 
ally required reduces the resource cost of the 
process by the proportion that is never ac- 
cessed, and by the decreasing cost of tech- 
nology through time. 

• Content can be migrated directly from the 
original to the current format, minimizing 
the effects of format conversion artefacts. 

• The format converters, once developed, can 
themselves be preserved to document the 
original format. Note that a converter can 
be developed pre-emptively before the for- 
mat goes obsolete and preserved against fu- 
ture need if the format's longevity is suspect. 

• As with other migration strategies, careful 
choice of the format to migrate to can greatly 
reduce the need for and cost of future migra- 
tions. 

The disadvantages of the migration on ac- 
cess strategy are that dynamic format migra- 
tion may impose significant delays on reader's ac- 
cesses to preserved material, and that it requires 
close integration with the dissemination pipeline 
delivering the digital preservation system's pre- 
served content to its readers. 

3 Format Migration in the LOCKSS Sys- 
tem 

The LOCKSS system provides librarians 
with a simple, low-cost tool they can use to en- 
sure their community's continuing access to ma- 
terial published on the Web |Hj • It is designed to 
handle both for-fee subscription e-journals and 
open access material whose copyright is held by 
the publisher, not the library's institution. Li- 
braries run LOCKSS peers, low-cost PCs running 
free, open source software that: 

Collects the material to be preserved by crawl- 
ing the publisher's web site, after verifying 
that the publisher has granted suitable per- 
mission. 

Preserves the material by cooperating with 
other peers holding the same material in a 
mutual audit process by running polls to 
identify any missing or damaged content and 
repair it. 



Disseminates the preserved material by acting 
as a proxy cache, intercepting requests from 
the library's browsers for the original URL 
from which the material was collected. If 
the publisher's copy is still available, it is 
delivered. Otherwise the preserved copy is 
delivered. 

The LOCKSS system was released for pro- 
duction use in April 2004 and about 80 libraries 
world-wide now use it. Publishers of over 2000 
titles have endorsed the system. 

3.1 Design 

In the LOCKSS system it is natural to 
use the migration on access strategy. Preserv- 
ing content in the original format greatly simpli- 
fies the mutual audit and repair process, and the 
LOCKSS system already implements the com- 
plete dissemination pipeline into which the mi- 
gration process must be integrated. 

The LOCKSS system is being enhanced to 
provide: 

• An API for plug-in format convertors, by 
which they can register their input and out- 
put Mi me- Type values, and by which the 
LOCKSS web proxy code can invoke them 
to perform on-the-fly conversion. 

• A matching process which takes the Accept: 
header of incoming requests and compares it 
to the original format of the preserved con- 
tent. If the original format is not acceptable, 
the matching process searches the table of 
registered format convertors looking for one 
which takes the original format as input and 
whose output format is acceptable. If a suit- 
able convertor is not found, a 406 error is 
returned as required by Section 10.4.7 of 0- 

• A distributed registry of convertors, simi- 
lar to the distributed registry of the plug- 
ins that adapt the LOCKSS system to par- 
ticular content. These registries treat Java 
classes exactly as other Web content; collect- 
ing them by crawling the Web and preserv- 
ing them by mutual audit and repair. 

3.2 Proof-of-Concept Implementation 

To confirm the feasibility of this design, 
a proof-of-concept was implemented and tested. 
We chose an "obsolete" format widely used 



in actual content collected by the production 
LOCKSS system, and a suitable "current" for- 
mat to replace it. The obsolete format was 
GIF PP, an old format for images. The GIF for- 
mat has been deprecated by many open source 
advocates for reasons connected with intellectual 
property restrictions, and they have developed 
the PNG 3 format as a replacement. This back- 
ground makes our assessment less artificial, as 
the format migration in question has been ac- 
tively solicited. Tools for converting from GIF to 
PNG are widely available, as would be expected 
if a widely used Web format were to become ob- 
solete. 

We did not implement the full Mime- Type 
matching process, but rather a configuration op- 
tion that prevented image/gif from matching any 
Accept: header. The mis-match triggered a GIF- 
to-PNG conversion directly, delivering the con- 
tent converted to PNG at the original URL but 
with Mime-Type=image/png. 

3.3 Assessment 

As can be seen from the before (Figure 
and after (Figure^ screen-shots, this format mi- 
gration is not perceptible to the user. Nor does 
GIF-to-PNG format migration incur a noticeable 
delay in accessing the page. 

4 Future Work 

The next step is to replace the proof-of- 
concept implementation by a full implementation 
of the API for plug-in format convertors, and an 
broader set of convertors than just GIF-to-PNG. 
This implementation will need a more realistic- 
scale test, and we are arranging to conduct one. 

Another approach would be to connect the 
API to a format migration service such as TOM 
(Typed Object Model) 0. We are investigating 
this possibility. 

The development of future format conver- 
tors will be significantly easier if more, and 
more reliable format metadata is available. We 
are working towards incorporating Harvard's 
JHOVE % format metadata extraction and val- 
idation technology to improve the quality of the 
format metadata in the LOCKSS system. 

5 Conclusion 

We have designed, implemented a proof- 
of-concept and demonstrated transparent for- 
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Figure 1 : A GIF image from an article in Journal 
of Histchemistry and Cytochemistry before the 
simulated obsolescence of GIF. Note the header. 
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Figure 2: The same image after the simulated 
obsolescence of GIF. The header shows that the 
GIF file has been converted to PNG. 



mat migration on access for the LOCKSS digital 
preservation system. By doing so we have vali- 
dated one of the possible format migration strate- 
gies, and reassured the community of LOCKSS 
users that when the time comes the content they 
are preserving will remain accessible despite the 
obsolescence of the formats in which it was col- 
lected. 
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